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Abstract 

We present a novel multilabel/ranking algorithm working in partial informa- 
tion settings. The algorithm is based on 2nd-order descent methods, and relies 
on upper-confidence bounds to trade-off exploration and exploitation. We analyze 
this algorithm in a partial adversarial setting, where covariates can be adversar- 
ial, but multilabel probabilities are ruled by (generalized) linear models. We show 
Q^j,i/2 iQgy^ regret bounds, which improve in several ways on the existing re- 
sults. We test the effectiveness of our upper-confidence scheme by contrasting 
against full-information baselines on real-world multilabel datasets, often obtain- 
ing comparable performance. 

1 Introduction 

Consider a book recommendation system. Given a customer's profile, the system rec- 
ommends a few possible books to the user by means of, e.g., a Umited number of 
banners placed at different positions on a webpage. The system's goal is to select 
books that the user likes and possibly purchases. Typical feedback in such systems is 
the actual action of the user or, in particular, what books he has bought/preferred, if 
any. The system cannot observe what would have been the user's actions had other 
books got recommended, or had the same book ads been placed in a different order 
within the webpage. Such problems are collectively referred to as learning with par- 
tial feedback. As opposed to the full information case, where the system (the learning 
algorithm) knows the outcome of each possible response (e.g., the user's action for 
each and every possible book recommendation placed in the largest banner ad), in the 
partial feedback setting, the system only observes the response to very limited options 
and, specifically, the option that was actually recommended. In this and many other 
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examples of this sort, it is reasonable to assume that recommended options are not 
given the same treatment by the system, e.g., large banners which are displayed on top 
of the page should somehow be more committing as a recommendation than smaller 
ones placed elsewhere. Moreover, it is often plausible to interpret the user feedback as 
a preference (if any) restricted to the displayed alternatives. 

We consider instantiations of this problem in the multilabel and leaming-to-rank 
settings. Learning proceeds in rounds, in each time step t the algorithm receives an in- 
stance Xt and outputs an ordered subset Yt of labels from a finite set of possible labels 
[K] = {1,2,..., K}. Restrictions might apply to the size of Yt (due, e.g., to the num- 
ber of available slots in the webpage). The set Yt corresponds to the aforementioned 
recommendations, and is intended to approximate the true set of preferences associated 
with Xt- However, the latter set is never observed. In its stead, the algorithm receives 
YtC\Yt, where Yj C [K] is a noisy version of the true set of user preferences on Xt- 
When we are restricted to \Yt\ = 1 for all t, this becomes a multiclass classification 
problem with bandit feedback - see below. 

Related work. This paper lies at the intersection between online learning with 
partial feedback and multilabel classification/ranking. Both fields include a substantial 
amount of work, so we can hardly do it justice here. We outline some of the main 
contributions in the two fields, with an emphasis on those we believe are the most 
related to this paper 

A well-known and standard tool of facing the problem of partial feedback in on- 
line learning is to trade off exploration and exploitation through upper confidence 
bounds f20|. In the so-called bandit setting with contextual information (sometimes 
called bandits with side information or bandits with covariates, e.g., 121 l6l l9l l5l [T9l . 
and references therein) an online algorithm receives at each time step a context (typ- 
ically, in the form of a feature vector x) and is compelled to select an action (e.g., a 
label), whose goodness is quantified by a predefined loss function. Full information 
about the loss function is not available. The specifics of the interaction model deter- 
mines which pieces of loss will be observed by the algorithm, e.g., the actual value of 
the loss on the chosen action, some information on more profitable directions on the 
action space, noisy versions thereof, etc. The overall goal is to compete against classes 
of functions that map contexts to (expected) losses in a regret sense, that is, to obtain 
sublinear cumulative regret bounds. For instance, IS] |6] |9l [T] work in a finite action 
space where the mappings context-to-loss for each action are linear (or generalized lin- 
ear, as in |9|) functions of the features. They all obtain T^/^-like regret bounds, where 
T is the time horizon. This is extended in |[T9l , where the loss function is modeled as 
a sample from a Gaussian process over the joint context-action space. We are using a 
similar (generalized) linear modeling here. Linear multiclass classification problems 
with bandit feedback are considered in, e.g., lfT6l l5l fT4l . where either T^/^ or T^/^ 
or even logarithmic regret bounds are proven, depending on the noise model and the 
underlying loss functions. 

All the above papers do not consider structured action spaces, where the learner 
is afforded to select sets of actions, which is more suitable to multilabel and ranking 
problems. Along these lines are the papers 1, 1 3] l27l [TSl l25l [24J . The general prob- 
lem of online minimization of a submodular loss function under both full and bandit 
information without covariates is considered in fT3l, achieving a regret T"^/^ in the 
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bandit case. In fTl] the problem of online learning of assignments is considered, where 
at each round an algorithm is requested to assign positions (e.g., rankings) to sets of 
items (e.g., ads) with given constraints on the set of items that can be placed in each po- 
sition. Their problem shares similar motivations as ours but, again, the bandit version 
of their algorithm does not explicitly take side information into account, and leads to a 
y2/3 j-ggj-gf bound. In ifTSl the aim is to learn a suitable ordering (an "ordered slate") 
of the available actions. Among other things, the authors prove a T^/^ regret bound 
in the bandit setting with a multiplicative weight updating scheme. Yet, no contextual 
information is incorporated. In f25 1 the ability of selecting sets of actions is motivated 
by a problem of diverse retrieval in large document collections which are meant to live 
in a general metric space. In contrast to our paper, that approach does not lead to strong 
regret guarantees for specific (e.g., smooth) loss functions. Il24l uses a simple linear 
model for the hidden utility function of users interacting with a web system and pro- 
viding partial feedback in any form that allows the system to make significant progress 
in learning this function (this is called an a-informative feedback by the authors). A re- 
gret bound of T^/^ is again provided that depends on the degree of informativeness of 
the feedback. It is experimentally argued that this feedback is typically made available 
by a user that clicks on relevant URLs out of a list presented by a search engine. De- 
spite the neatness of the argument, no formal effort is put into relating this information 
to the context information at hand or to the way data are generated. 

The literature on multilabel learning and learning to rank is overwhelming. The 
wide attention this literature attracts is often motivated by its web-search-engine or 
recommender-system applications, and many of the papers are experimental in nature. 
Relevant references include ESlfTTl lSl. along with references therein. Moreover, when 
dealing with multilabel, the typical assumption is full supervision, an important con- 
cern being modeling correlations among classes. In contrast to that, the specific setting 
we are considering here need not face such a modeling 1 8 1 . Other related references are 
ifTSllTOl . where learning is by pairs of examples. Yet, these approaches need i.i.d. as- 
sumptions on the data, and typically deliver batch learning procedures. To summarize, 
whereas we are technically close to ||2]|6]|5]|9l[I]|T9l, from a motivational standpoint 
we are perhaps closest to |[27l ITSi l24l . 

Our results. We investigate the multilabel and leaming-to-rank problems in a par- 
tial feedback scenario with contextual information, where we assume a probabilistic 
linear model over the labels, although the contexts can be chosen by an adaptive ad- 
versary. We consider two families of loss functions, one is a cost-sensitive multilabel 
loss that generalizes the standard Hamming loss in several respects, the other is a kind 
of (unnormalized) ranking loss. In both cases, the learning algorithm is maintaining a 
(generalized) linear predictor for the probability that a given label occurs, the ranking 
being produced by upper confidence-corrected estimated probabilities. In such settings, 
we prove T^/^ log T cumulative regret bounds, which are essentially optimal (up to log 
factors). A distinguishing feature of our user feedback model is that, unlike previous 
papers (e.g., |fT3ll27l [Tl [T9l ), we are not assuming the algorithm is observing a noisy 
version of the risk function on the currently selected action. In fact, when a generalized 
linear model is adopted, the mapping context-to-risk turns out to be nonconvex in the 
parameter space. Furthermore, when operating on structured action spaces this more 
traditional form of bandit model does not seem appropriate to capture the typical user 
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preference feedback. Our approach is based on having the loss decouple from the label 
generating model, the user feedback being a noisy version of the gradient of a surrogate 
convex loss associated with the model itself. As a consequence, the algorithm is not 
directly dealing with the original loss when making exploration. Though the emphasis 
is on theoretical results, we also validate our algorithms on two real-world multilabel 
datasets w.r.t. a number of loss functions, showing good comparative performance 
against simple multilabel/ranking baselines that operate with full information. 

2 Model and preliminaries 

We consider a setting where the algorithm receives at time t the side information vector 
Xt e M'', is allowed to output at a (possibly ordered) subset Yt C [A'] of the set of 
possible labels, then the subset of labels Yt C [K] associated with Xt is generated, and 
the algorithm gets as feedback 1^ n Y^. The loss suffered by the algorithm may take 
into account several things: the distance between Yt and Yt (both viewed as sets), as 
well as the cost for playing Yf The cost c(Yf) associated with Yt might be given by 
the sum of costs suffered on each class i G Yt, where we possibly take into account 
the order in which i occurs within Yt (viewed as an ordered list of labels). Specifically, 
given constant a G [0, 1] and costs c = {c{i, s),i = 1, . . . , s, s G [K]}, such that 
1 > c(l, s) > c(2, s) > . . . c{s, s) > 0, for all s e [K], we consider the loss function 

iaAYuYt) ^a\Yt\ Yt\ + (!-«) E^eY,\Y, ^0'., I^'tl), 

where ji is the position of class i in Yj, and c{ji, •) depends on Yt only through its size 
\Yt\- In the above, the first term accounts for the false negative mistakes, hence there 
is no specific ordering of labels therein. The second term collects the loss contribution 
provided by all false positive classes, taking into account through the costs c(j,;, \ Yt\) 
the order in which labels occur in Yf The constant a serves as weighting the relative 
importance of false positive vs. false negative mistake^ As a specific example, sup- 
pose that K ~ 10, the costs c{i, s) are given by c{i, s) — {s — i + l)/s,i = 1, . . . , s, 
the algorithm plays Yt = (4, 3, 6), but Yt is {1, 3, 8}. In this case, \Yt \ Yt\ =2, and 
J2ieYt\Yt 1*1) ~ '^/'^ ^Z"^' '■^^ ^'^^ mistakingly playing class 4 in the 
top slot of Yt is more damaging than mistakingly playing class 6 in the third slot. In the 
special case when all costs are unitary, there is no longer need to view Yt as an ordered 
collection, and the above loss reduces to a standard Hamming-like loss between sets 
Yt and % i.e., a\Yt\Yt\ + {l~ a) \Yt \ Yt\. Notice that the partial feedback Yt Yt 
allows the algorithm to know which of the chosen classes in Yt are good or bad (and 
to what extent, because of the selected ordering within Yt). Yet, the algorithm does not 
observe the value of £a.c{Yt,Yt) bacause Yt \ Yt remains hidden. 

Working with the above loss function makes the algorithm's output Yt become a 
ranked list of classes, where ranking is restricted to the deemed relevant classes only. 
In our setting, only a relevance feedback among the selected classes is observed (the 
set Yt n Yt), but no supervised ranking information (e.g., in the form of pairwise pref- 
erences) is provided to the algorithm within this set. Alternatively, we can think of a 

'Notice that a is not redundant here, since the costs c(i, s) have been normalized to [0,1]. 
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ranking framework where restrictions on the size of Yt are set by an exogenous (and 
possibly time-varying) parameter of the problem, and the algorithm is required to pro- 
vide a ranking complying with these restrictions. More on the connection to the ranking 
setting with partial feedback is in Sec. |4] 

The problem arises as to which noise model we should adopt so as to encompass 
significant real-world settings while at the same time affording efficient implementa- 
tion of the resulting algorithms. For any subset Yt C [K], we let (yi.t, . . . , yK.t) € 
{0, 1}^ be the corresponding indicator vector Then it is easy to see that la,c(Yt,Yt) = 

a Eti Uu + {l-a) Y.^e% (^(j., I^^d) " (i^ + l^^tl)) 2/m) ■ Moreover, be- 

cause the first sum does not depend on Yt, for the sake of optimizing over Yt we can 
equivalently define 

LAYuYt)^{l-a) ^ (cO-,|ft|)- (^+cO-,|ft|)) y^.t) . (1) 

Let Pt(-) be a shorthand for the conditional probability Pf (• | Xt), where the side infor- 
mation vector Xt can in principle be generated by an adaptive adversary as a function 
of the past. Then Vt{yi,t, ■ • ■ , yK,t) = P(yi,t, ■ • ■ , yK,t \ Xt), where the marginals 
¥t{y^.t = 1) satisf>0 

Vtiy^.t^i)= , T^~r^r^ y ^^l,.■■,K, (2) 

g{ul xt)+g{-ul Xt) 

for some K vectors Ui, . . . , uk G TV'' and some (known) function g : _D C 7^ — )• 7?.+ . 
The model is well defined if uj x G D for all i and all x e 7?^"^ chosen by the adversary. 
We assume for the sake of simplicity that ||a;(|| = 1 for all t. Notice that the variables 
yi^t need not be conditionally independent. We are only definining a family of allowed 
joint distributions Vtiyi^t, ■ ■ ■ , yK.t) through the properties of their marginals ¥t{yi^t)- 

The function g above will be instantiated to the negative derivative of a suitable 
convex and nonincreasing loss function L which our algorithm will be based upon. For 
instance, if L is the square loss L{A) = (1 — A)^/2, then g{A) = 1 — A, resulting 
in Vt{yi,t = 1) = (1 + t^Jxt)/'2, under the assumption D — [—1, 1]. If L is the 
logistic loss i(A) = ln(l + e"'^), then 5(A) = (e'^ + and ¥t{y^.t = 1) = 

^■aj^t /(e"^^t + 1), with domain D^U. 

Set for brevity Aj j = ujxt. Taking into account (|IJ, this model allows us to write 
the (conditional) expected loss of the algorithm playing Yt as 

Et[£aAYt,Yt)] = (^(^■^' l^tl) - (1^ + \Yt\)) P^,t) , (3) 

where pit = g(^A^t)+g{-^- 1) ' expectation above is w.r.t. the generation of 

labels Yt, conditioned on both Xt, and all previous x and Y. A key aspect of this for- 
malization is that the Bayes optimal ordered subset — argminy^Q-^ j •j(2[K]^t[ia,c(Yt,Y)] 



'■ The reader familiar witli generalized linear models will recognize the derivative of the function p( A) 



A)-\-^-A) (inverse) link function of the associated canonical exponential family of distributions 
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can be computed efficiently when knowing Ai < , . . . , AK,t- This is handled by the next 
lemma. In words, this lemma says that, in order to minimize (j3]l, it suffices to try out all 
possible sizes s = 0, 1, . . . , i^T for Y^* and, for each such value, determine the sequence 
Y*^ that minimizes (l3|l over all sequences of size s. In turn, Y*^ can be computed just 
by sorting classes i ej-ftT] in decreasing order of pi t, sequence Y*f being given by the 
first s classes in this sorted list|2] 

Lemma 1. With the notation introduced so far, let pi-^^t > Pi2,t > ■ • -PiK-t be the se- 
quence of Pit sorted in nonincreasing order Then we have that Y^* — argmin^^Q i ^-Ef [£a,c(Yt ,Y*f)], 
where Y*^ = (ii, 12, . . . , is), and Yq^ = 0. 

Notice the way costs c{i, s) influence the Bayes optimal computation. We see from 
(jijl that placing class i within Yt in position ji is beneficial (i.e., it leads to a reduction 
of loss) if and only if pi^t > c{ji, \Yt\)/{Yr^ + c{ji, Hence, the higher is the 

slot ij in Yt the larger should be j in order for this inclusion to be convenient /f is 
Yt* that we interpret as the true set of user preferences on Xt- 

We would like to compete against the above Y^* in a cumulative regret sense, i.e., 

we would like to bound i?T = ELi [LAYt , f* )] -E* [£a,c , )] with high prob- 
ability. Inspired by fS^, we devise an online second-order descent algorithm whose up- 
dating rule makes the comparison vector U — (tti, . . . , uk) G TZ'^^ defined through 
^ be Bayes optimal w.r.t. a surrogate convex loss L( ) such that g{A) = — L'(A). 
Observe that the expected loss function ^ is, generally speaking, nonconvex in the 
margins t (consider, for instance the logistic case 5(A) — ^a^i )- Thus, we cannot 
directly minimize this expected loss. 



3 Algorithm and regret bounds 

In Figure [T] is our bandit algorithm for (ordered) multiple labels. The algorithm is 
based on replacing the unknown model vectors Ui, . . . , uk with prototype vectors 
w'l i; . . . , w'j^ being w[ j the time-t approximation to Ui, satisying similar con- 
straints we set for the Ui vectors. For the sake of brevity, we let A'^ j — xj w[ t, 
and Ai t — ujxt, i £ [K]. The algorithm uses A'^ ^ as proxies for the underlying 

Ai t according to the (upper confidence) approximation scheme A; t w [A'j ^ + ei^t]D, 
where 4 > is a suitable upper-confidence level for class i at time t, and [ Jc denotes 
the clipping-to-D operation, i.e., [x]]j = max(min(a::, R), —R). The algorithm's pre- 
diction at time t has the same form as the computation of the Bayes optimal sequence 
Yt*, where we replace the true (and unknown) pi t — g(A-^f^)+^(-A- 1) ^^^^ '■^^ corre- 
sponding upper confidence proxy p^t = ^, !^!:!V''<1"\ — r^- Computing 
Yt can be done by mimicking the computation of the Bayes optimal Yt* (just replace 



' Due to space limitations, all proofs are given in the supplementary material. 
Notice that this depends on the actual size of Yt , so we cannot decompose this problem into K inde- 
pendent problems. The decomposition does occur if the costs c(i, s) are constants, independent of i and s, 
and the criterion for inclusion becomes pi t > 9, for some constant threshold 8. 
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Parameters: loss parameters a £ [0, 1], cost values c{i,s), interval D = [—R,R], 
function g : D ^ TZ, confidence level 5 G [0, 1]. 

Initialization: Ai,o = / e 7^'*'"*, i = l,...,K, Wi^i = G 7^'', i = l,...,K; 
For t = 1,2. . . ,r : 

1. Get instance cct £ TZ'' : ||cct|| = 1; 

2. For i G [K], set j = xjw'i t, where 




if iuLa;t G [~R,R], 
A~l_iXt otherwise; 



3. Output 

Ft = argminy^(j^_^^_ ,^l^l)^c[x] (E.g^ (cO'^, li^l) - (t^ + c(ji, |y|)) Pz.t 

, - _ 9(-[^i,t+gi,t]D) 

9([A'^.,+.„t]z,)+9(-lAj 

4. = (f/^ + In (1 + if (I + 3L(-J?)) In ^1 

4. Get feedback Yt n Ft; 

5. Fori G [-K"], update A,,t = Ai^t^i + \si,t\^txJ ,Wi^t+\ = w'^^,- ^A~lV i^t, 
where 

Si,t=l-l liieYt\Yt=Yt\(Xtf\Yt) 
y otherwise; 

and V,,t = VT„L(si,t ly^cci)!^^.^,/ = ~g{si,t A-_t) Si,i cct. 



Figure 1 : The partial feedback algorithm in the (ordered) multiple label setting. 

Pi.t by Pi.t), i-C-, order of K log K running time per prediction. Thus the algorithm is 
producing a ranked list of relevant classes based on upper-confidence-corrected scores 
Pi,t. Class i is deemed relevant and ranked high among the relevant ones when either 
( is a good approximation to A^ t and pi,t is large, or when the algorithm is not very 
confident on its own approximation about i (that is, the upper confidence level t is 
large). 

The algorithm receives in input the loss parameters a and c(i, s), the model func- 
tion g{-) and the associated margin domain D = [— i?, i?], and maintains both K 
positive definite matrices A, ^ of dimension d (initially set to the d x d identity ma- 
trix), and K weight vector Wi t G TZ^ (initially set to the zero vector). At each time 
step t, upon receiving the d-dimensional instance vector Xf the algorithm uses the 
weight vectors Wi^t to compute the prediction vectors w'^ f . These vectors can easily 
be seen as the result of projecting Wi^t onto interval D = [—R, R] w.r.t. the distance 
function di^t-i, i.e., w'^ ^ = argmin^g^^d . „Ta,^g£, di,t_i(u;, lo^^f « e [K], where 
di^t{u,w) = (u — w)^ Ai^t {u — w) . Vectors w[ ^ are then used to produce pre- 
diction values A^ ^ involved in the upper-confidence calculation of Yt ^ [K] . Next, 
the feedback Yt n Yt is observed, and the algorithm in Figure [l] promotes all classes 
i G YtOYt (sign Si^t = 1)^ demotes all classes i G \ Yj (sign s,j ^ = —1), and leaves 
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all remaining classes i ^ Yt unchanged (sign ^ — 0). The update w'^ ^ — > Wi^t-\-i 
is based on the gradients Vi_t of a loss function L{-) satisfying L'{A) — — g(A). On 
the other hand, the update Ai^t-i — > ^i,t uses the rank one matrix XfxJ . In both the 
update of w'^ ^ and the one involving Ai t-i, the reader should observe the role played 
by the signs Finally, the constants c'j^ and c'[ occurring in the expression for ef^ 
are related to smoothness properties of i(-) - see next theorem. 

Theorem 2. Let L : D = [—R,R\ <Z TZ ^ TZ+ be a C^{D) convex and non- 
increasing function of its argument, {ui, . . . ,Uk) G T?.'^^ be defined in ^ with 
g{/S) — —L' (A) for all A G I?, and such that \\ui\\ < U far all i G [K]. Assume there 
are positive constants cl, c'j^ and c'l such that: i. ^ ^'^'^ (l'*(A)+^^(-A))^ (~^) > —cj^ 
and a. (L'(A))^ < c^, and Hi. L"(A) > c'^ hold for all A G D. Then the cumulative 
regret Rt of the algorithm in Figure^satisfies, with probability at least 1 — S, 



Rt = O ({1 - a) CL K Jt C d \n {1 + ^ 



whereC = 0{u^ + ^ln{l + ^) + {^ + ^) In ^) . 

It is easy to see that when L{-) is the square loss L{A) = (1 — A)^/2 and D = 
[— 1, 1], we have cl = 1/2, c'^ = 4 and c'l — 1; when L{-) is the logistic loss L{A) = 
ln(l + e-^) and D = [~R, R], we have cl = 1/4, c'^ < 1 and c'l = 2(i+cosh(fi)) - 
where cosh(.T) = — . 

Remark 1. A drawback ofTheorem^is that, in order to properly set the upper confi- 
dence levels ti^t, we assume prior knowledge of the norm upper bound U. Because this 
information is often unavailable, we present here a simple modification to the algorithm 
that copes with this limitation. We change the definition of ef j in Figure |7| fo e| j = 

me^^i^x^Ail^x In (l + + 1| (| + 3L(-i?)) In ^) , 4i?2| . This 

immediately leads to the following result. 

Theorem 3. With the same assumptions and notation as in Theorem^ if we replace 
ef f. as explained above we have that, with probability at least 1 — S, Rt satisfies 

Rt = 



o((l-a) CLK^TCd\n{l + ^) + {l-a) c^RRd (exp (^^f^F) - l)) 



4 On ranking with partial feedback 

As Lemma[r|points out, when the cost values c(i, s) in ta,c are stricly decreasing then 
the Bayes optimal ordered sequence Y^ on Xt can be obtained by sorting classes in 
decreasing values of j, and then decide on a cutoff poinj^ induced by the loss pa- 
rameters, so as to tell relevant classes apart from irrelevant ones. In turn, because 

' This is called the zero point in 1111 . 
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p{A) = g(A^)+g^lA) increasing in A, this ordering corresponds to sorting classes 
in decreasing values of Aj,(. Now, if parameter a in £a.c is very clos^to 1, then 
lYj* I — K, and the algorithm itself will produce ordered subsets Yj such that [Yt | = K. 
The resulting algorithm can thus be A^^.t > A^^ t > ■ • • > ^^^i^.t over all classes. 
Moreover, it does so by receiving /mZ/ feedback on the relevant classes at time t (since 
Yt DYt = Yt)- As is customary (e.g., |8|), one can view any multilabel assignment 
Y = . . . , uk) £ {0, 1}^ as a ranking among the K classes in the most natural 
way: i preceeds j if and only if j/j > yj. The (unnormalized) ranking loss function 
irank(Y, f ) between the multilabel Y and a ranking function / : TZ"^ — > TZ^ , repre- 
senting degrees of class relevance sorted in a decreasing order fj^{xt) > fj2{xt) > 
■ ■ ■ > fjKi^t), counts the number of class pairs that disagree in the two rankings: 

irankiYJ) = e[K] : y,>y, {{M^t) < fj{Xt)} + ^{Mxt)^f,{Xt)}), where 

{. . .} is the indicator function of the predicate at argument. As pointed out in fE\, the 
ranking function f{xt) ~ {pi.t, ■ ■ ■ ,PK,t) is also Bayes optimal w.rt. £rank{Y, /), no 
matter if the class labels yi are conditionally independent or not. Hence we can use 
this algorithm for tackling ranking problems derived from multilabel ones, when the 
measure of choice is £rank and the feedback is full. 

In fact, a partial information version of the above can easily be obtained. Suppose 
that at each time t, the environment discloses both Xt and a maximal size St for the 
ordered subset Yt = (ji, j2, ■ • • 7J|i>|) (both Xt and St can be chosen adaptively by an 
adversary). Here St might be the number of available slots in a webpage or the number 
of URLs returned by a search engine in response to query Xt- Then it is plausible to 
compete in a regret sense against the best time-t offline ranking of the form f{xt) = 
{fi{xt), J2{xt)i ■ ■ ■ , fh{xt), 0, . . . , 0), with h < St- Further, the ranking loss could 
be reasonably restricted to count the number of class pairs disagreeing within Yt plus 
the number of false negative mistakes. E.g., if /^^ (a;^) > fj^ (xt) > ■ ■ ■ > fj^^ ^ (xt), 
we can set 

(rank AY J) = 

^^.Je%■.y.>y, {iM^t) < fjixt)} + \ {Uxt) = Uxt)}) + \Yt \ Yt\ . 

It is not hard to see that the Bayes optimal ranking for £rank.t is given by /* {xt; St) = 
{pi^^t, ■ ■ ■ ,Pist,t, 0, . . . , 0). If we put on the argmin (Step 3 in Figure fill the further 
constraint |y| < St (notice that the computation is still about sorting classes accord- 
ing to decreasing values of Pi^t), one can prove the following ranking counterpart to 
Theorem 12] 

Theorem 4. With the same assumptions and notation as in Theorem^ let the cumula- 
tive regret Rt w.r.t. irank,t be defined as 

Rt = Y.^=i^t[iTank.t{Yt, {Pji,t, ■••,P7St,t'0'---'0))] " 
^t[^rank,t{Yt, (p^^ p^^^ ,t , 0, ...,0))], 

* If a = 1, the algorithm only cares about false negative mistakes, the best strategy being always predict- 
ing Yt = [K]. Unsurprisingly, this yields zero regret in both Theorems 2 and 3 
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where pj-^t 5^ • ■ • 5^ PiSft ^ and pi-^j > ■•■ > Pist,t ^ 0. Then, with prob- 
ability at least 1 ~ S, we have Rt = O (^cl \J S KT C d ln(l + -j)^ , where S = 
maxt^i_..._T St- 

The proof (see the appendix) is very similar to the one of Theorem [2] This sug- 
gests that, to some extent, we are decoupling the label generating model from the loss 
function I under consideration. Notice that the linear dependence on the total number 
of classes K (which is often much larger than S" in a multilabel/ranking problem) is 
replaced by \/ SK. One could get similar benefits out of Theorem|2] Finally, one could 
also combine Theorem [4] with the argument contained in Remark 1. 

5 Experiments and conclusions 

The experiments we report here are meant to validate the exploration-exploitation 
tradeoff implemented by our algorithm under different conditions (restricted vs. non- 
restricted number of classes), loss measures {ia.c, irank,t, and Hamming loss) and 
model/parameter settings (L = square loss, L = logistic loss, with varying R). 
Datasets. We used two multilabel datasets. The first one, called Mediamill, was 
introduced in a video annotation challenge 1,26 j . It comprises 30,993 training samples 
and 12,914 test ones. The number of features d is 120, and the number of classes K is 
101. The second dataset is Sony CSL Paris (22], made up of 16,452 train samples and 
16,519 test samples, each sample being described by d = 98 features. The number of 
classes K is 632. In both cases, feature vectors have been normalized to unit L2 norm. 
Parameter setting and loss measures. We used the algorithm in Figure [T] with two 
different loss functions, the square loss and the logistic loss, and varied the parameter 
R for the latter. The setting of the cost function c(i, s) depends on the task at hand, 
and for this preliminary experiments we decided to evaluate two possible settings only. 
The first one, denoted by "decreasing c" is c(i, s) — s^^iil^ i — 1, . . . , s, the second 
one, denoted by "constant c", is c{i, s) = 1, for all i and s. In all experiments, the a 
parameter was set to 0.5, so that ia,c with constant c reduces to half the Hamming loss. 
In the decreasing c scenario, we evaluated the performance of the algorithm on the loss 
ia.c that the algorithm is minimizing, but also its ability to produce meaningful (partial) 
rankings through £rank,t- On the constant c setting, we evaluated the Hamming loss. 
As is typical of multilabel problems, the label density, i.e., the average fraction of labels 
associated with the examples, is quite small. For instance, on Mediamill this is 4,3%. 
Hence, it is clearly beneficial to impose an upper bound S on \Yt\. For the constant c 
and ranking loss experiments we tried out different values of S, and reported the final 
performance. 

Baseline. As baseline, we considered a full information version of Algorithm [T] using 
the square loss, that receives after each prediction the full array of true labels Yt for 
each sample. We call this algorithm OBR (Online Binary Relevance), because it is 
a natural online adaptation of the binary relevance algorithm, widely used as a base- 
line in the multilabel literature. Comparing to OBR stresses the effectiveness of the 
exploration/exploitation rule above and beyond the details of underlying generalized 



10 



Figure 2: Experiments on the Sony CSL Paris dataset. 




Figure 3: Experiments on the Mediamill dataset. 



hnear predictor. OBR was used to produce subsets (as in the Hamming loss case), and 
restricted rankings (as in the case of £rank,t)- 

Results. Our results are summarized in Figures |2] and |3] The algorithms have been 
trained by sweeping only once over the training data. Though preliminary in nature, 
these experiments allow us to draw a few conclusions. Our results for the avarage 
£a,c{Yt,Yt) with decreasing c are contained in the two left plots. We can see that the 
performance is improving over time on both datasets, as predicted by Theorem [2] In 
the middle plots are the final cumulative Hamming losses with constant c divided by 
the number of training samples, as a function of S. Similar plots are on the right with 
the final average ranking losses irank,t- In both cases we see that there is an optimal 
value of S that allows to balance the exploration and the exploitation of the algorithm. 
Moreover the performance of our algorithm is always pretty close to the performance of 
OBR, even if our algorithm is receiving only partial feedback. In many experiments the 
square loss seems to give better results. Exception is the ranking loss on the Mediamill 
dataset (Figure |3] right). 

Conclusions. We have used generalized linear models to formalize the exploration- 
exploitation tradeoff in a multilabel/ranking setting with partial feedback, providing 
T^/^-like regret bounds under semi-adversarial settings. Our analysis decouples the 
multilabel/ranking loss at hand from the label-generation model. Thanks to the usage 
of calibrated score values pit, our algorithm is capable of automatically inferring where 
to split the ranking between relevant and nonrelevant classes 1 11 1, the split being clearly 
induced by the loss parameters in la,c- We are planning on using more general label 
models that explicitly capture label correlations to be applied to other loss functions 
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(e.g., F-measure, 0/1, average precision, etc.). We are also planning on carrying out 
a more thorough experimental comparison, especially to full information multilabel 
methods that take such correlations into account. Finally, we are currenty working 
on extending our framework to structured output tasks, like (multilabel) hierarchical 
classification. 

6 Appendix 

This appendix contains the proofs of all lemmas and theorems presented in the main 
text. 

Proof: [LemmafTl First observe that, for any given size s, the sequence Y*^ must con- 
tain the s top-ranked classes in the sorted order of pi^f This is because, for any candi- 
date sequence = {ji,j2,---,is},wehaveEt[4^c(>^t*,>"s)] = Eiey, (c(ji,s)- (rz^ + c(ji, «)) Pi.t 
If there exists i G Kj which is not among the s-top ranked ones, then we could replace 
class i in position ji within with class k ^ Yg such that pk^t > Pi.t obtaining a 
smaller loss. 

Next, we show that the optimal ordering within Y*^ is precisely ruled by the non- 
icreasing order of pi^t. By the sake of contradiction, assume there are i and k in Y*f. 
such that i preceeds k in Y*^ but t > pi^f Specifically, let i be in position ji and 
k be in position j2 with ji < j2 and such that c(ji, s) > c{j2, s). Then switching the 
two classes within Y*^ yields an expected loss difference of 

C(il,s) - (t3^ + cO'l,s)) Pi,t+c{j2,s) - (^j^ + C(j2,s)) Pk,t 
- (c(jl, S) - {^j^ + C(ji, S)) pk,t) - (c02, S) - (^j^ + C{j2, S)) p,^t) 

= {Pk,t -Pi,t) (c(ji,s) - c(j2,s)) > . 
Hence switching would get a smaller loss which leads as a consequence to Y*^ = 

The algorithm in Figure [T| works by updating through the gradients Vi t of a mod- 
ular margin-based loss function XliLi ^i'^J^) associated with the label generation 
model (Kb so as to make the parameters {ui, . . . , uk) S TV^^ therein achieve the 
Bayes optimality condition 

(iti, . . . , Uk) = arg min Ef 

where Et[-] above is over the generation of Yt in producing the sign value t € 
{— 1, 0, +1}, conditioned on the past (in particular, conditioned on Yt). The require- 
ment in Q is akin to the classical construction of proper scoring rules in the statistical 
literature (e.g., |23|). 

The following lemma faces the problem of hand-crafting a convenient loss function 
L(-) such that Q holds. 



K 

.2 = 1 



Xt) 



(4) 
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Lemma 5. Let Wi, . . . , wk S TZ'^^ be arbitrary weight vectors such that wj Xt G D, 
i G [K], {ui, . . . , Uk) G TV^^ be defined in (|2l), Si^t be the updating signs computed 
by the algorithm at the end (Step 5) of time t, L : D — [~R, i?] C 7?. — > 7^"*" be a 
convex and dijferentiable function of its argument, with g{A) = — L'(A). Then for any 
t we have 



K 



>E* 



K 



^L{si^tujxt) 



i.e., Q holds. 

Proof: Let us introduce the shorthands — ujxt, A^ = wJ^Xt, Si — Si^t, and 
Pi — P(yi,f = 1 1 Xt) — l'(a')+l^('-A ) • Moreover, let P((-) be an abbreviation for 
the conditional probability P(- 1 {yi,Xi), . . . , {yt-i, a^t-i), Xf). Recalling the way ^ 
is constructed (Figure [TJ, we can write 



E. 



K 



.1=1 



= J2 {^tis^,t = 1) i(A.) + Ft{s^,t = -1) i(-A,)) +{K- \Yt\ 

ieYt 

= J2 {P^ L{A,) + (1 - p,) Li-Ai)) +{K- \Yt\) L{0) , 



ieYt 



For similar reasons, 

K 



E, 



i=l 



J2 iP^ L{^^) + (1 - P.) ^(-A.)) + {K- \Yt\) L(0) . 

ieYt 



Since L{-) is convex, so is Ej X^ili ^{^i,t Ai) 
We have that — n^.-i_ ^ '>J 

OA, 

satisfies 

L'{-A,) 

Pi = 



when viewed as a function of the A, . 
if and only if for all i e Yf we have that Aj 



U{At) + L'{-A, 



Since pi 



L'{-Ai 



T, we have that Et 



EtiLisi,tAt) 



is minimized when 



L'(Ai) + L'(-Ai)' 

Ai — Aj for all i E [K]. The claimed result immediately follows. □ 
Let now Vart{-) be a shorthand for Var{- \ {yi, Xi), . . . , {yt^i,Xt^i),Xt). The 
following lemma shows that under additional assumptions on the loss L{-), we are 
afforded to bound the variance of a difference of losses L( ) by the expectation of this 
difference. This will be key to proving the fast rates of convergence contained in the 
subsequent Lemma[9] 

Lemma 6. Let {w[ ^ , . . . , w'j^ ^) E TV^^ be the weight vectors computed by the algo- 
rithm in Figure^at the beginning (Step 2) of time t, Si^t be the updating signs computed 
at the end (Step 5) of time t, and (tti, . . . , Uk) G TZ'^^ be the comparison vectors de- 
fined through {2^. Let L . D = [—R, i?] C 7?. — ?> 7?.+ be a C^{D) convex function of 
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its argument, with g(A) — — L'(A) and such that there are positive constants and 
c'l with (i'(A))2 < c'^ andL"{A) > c'lforallA € D. Thenforanyi e % 

2c' 

< Vart {L{s,^t xjw'^^^) - L{s,^t uj Xt)) < —^Et [L{st,t xjw'^^^) - L{s,^t ujxt)] 



Proof: Let us introduce the shorthands = Xf Ui, Ai — x^ w\ ^, Si = t, and 

L'(A.)+L'(-A,)- 



Pi = V{yi,t = 1 1 a;*) = f'fA^/+A-A V Then, for any i G [K], 



Van {L{s^.t xjw'.^t) - L{s^^t uJxt)) < Et i^[L{s, A,) - L{s, A,)) j < c'^ [A, ~ A, 

(5) 

Moreover, for any i € Ft we can write 

Et [l{s, a,) - Lis, A,)] = (L(A,) - L(A,)) + (1 - p.) (i(-A,) - L{-A,)) 

> p, {l'{A,)(A, - A,) + ^(A, - A,^2 

+ {I -Pi) (^L'(-A,)(A, - A) + ^(A, - A, 

= ^(A, - A.,f + (1 -k) ^(A, - A,)2 

= ^(A,;-A,)^ (6) 

where the second equaHty uses the definition of pi. Combining (jSj) with (|6]l gives the 
desired bound. □ 
We continue by showing a one-step regret bound/or our original loss i'^ c- The pre- 
cise connection to loss L{-) will be established with the help of a later lemma (Lemma 

mi. 

Lemma 7. Let L : D = [—R, i?] C 7?. — !■ TZ'^ be a convex, twice differentiable, and 
nonincreasing function of its argument. Let {ui, . . . ,Uk) G TZ'^^ be defined in 
with g{A) = — L' ( A) for all A € D. Let also cl be a positive constant such that 

L'{A)L"{-A) + L"{A)L'{-A) 

(L'(A) + L'(-A))2 -""^ 

holds for all A ^ D. Finally, let A^t denote uJxt, and A'^ j denote xjw[ ^, where 
w[ J is the i-the weight vector computed by the algorithm at the beginning (Step 2) of 

time t. If time t is such that \Ait ~ A^ ^ < ti,tfor all i G [K], then 

Et[£aAYuYt)]-Et[iaAYt,Y;)]< 2(1- a) CL ^ ■ 
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Proof: Introduce the shorthand notation p{A) = g(A)+g'^-A) • write 
= (1 - a) ^ (c(i„ - + c{%, p{A,A 

ieYt 



{l-a)J2 {<J:^ \Yt*\) ~ (l^ + cUU \Y;\)) p{X,t)) , 



where ji denotes the position of class i in Yt and j* is the position of class i in Y^*. 
Now, 

^ -g-(-A)g(A)-g-(A)g(-A) ^ -£-(A) £"(-A) - £-(-A) £"(A) 
^ (g(A)+g(-A))2 (L'(A) + L'(-A))2 

since g(A) = — L'(A), and L(-) is convex and nonincreasing. Hence p{A) is itself 
a nondecreasing function of A. Moreover, the extra condition on L involving L' and 
L" is a Lipschitz condition on p{A) via a uniform bound on p'(A). Hence, from 
\Ai t — A^ ( I < t and the definition of Yt we can write 

Et[£aAYt,Yt)]~Et[eaJYt,Y;)] 

< (1 - a) ^ (c^, - (t^ + c(j„ |f,|)) p([A',,, - 



-{i-a)j2 {<j:ay:\)-[j^+c{j*,\y:\)) p{[Kt+e^AD) 

< (1 - a) ^ (c^, - + c(i„ |f,|)) p([A^^, - e,,,]^)) 
- (1 - a) ^ (c(j„ - + c(i-, |f,|)) p{[A[^t + 

= (1 - a) ^ (c{%, \%\) (p{[A[^t + -p([A:,, - e,,]D))) 

< 2 (1 - a) CL ^ e,,t , 

the last inequality deriving from c(i, s) < 1 for a[\i < s < K, and 

p(K_,+e,,t]i5)-p([A^,,-e,,t]i5) < CL([A',_, + e,,t]c-[A^_,-e,,t]D)< 2cLe^,t■ □ 
Likewise, we provide a similar bound for the ranking loss. 

Lemma 8. Under the same assumptions and notation as in Lemma^ let the Algorithm 
in FigureUlbe working with a — > 1 and strictly decreasing cost values c{i, s). Let w[ ^ 
be the i-tn weight vector computed by this algorithm at the beginning (Step 2) of time 
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t. If this algorithm ranks classes as Pj^.t > • • ■ > PjSft ^ 0> o'^d time t is such that 
|Ai,t - A- tl < ei,tforall i e [K], tlien 

]Et[4anfc,t(i^t, {Pn,t, ■ ■ ■ ,Pjs^,t,0, . . . ,0))] ~Et[£rankAYt, {Pii,t, ■ ■ ■ ,PiSt,t,0, ■ ■ • ,0))] 

<2StCL ^ ei,t , 

ieYt 

where the pi^t = ^t{yi,t = 1 1 ^^t) '^''^ sorted as pi-^^t > ••■ > Pisf* — ^' '^"'^ 

Yt = (ii,j2,---,isj- 

Proof: Recall the notation Pt(-) = P(- | Xt), and p,^ = p{\,t) - g(Af,V4(-AM) " 
Following 18J (proof of Theorem 2 therein), one can see that for generic sequences 
(pi,t, . . . ,pK.t) and {pi,t, . . . ,PK,t) one has 

^t[irank{Yt, {Pl.t, ■ ■ ■ ,PK,t))] " ^t[irank{Yt, {Pl.t, ■ ■ ■ ,PK,t))] 

= X! iHhj) +r{j,i) -r{j,i)) , 



i.3e[K],i<j 



(7) 



where 



r{i,j) = Ftivt^t > yj,t) {{Pi,t < Pj,t} + \ {Pi,t = Pj,t}) 
r{i,j) = Vtiyz,t > yj,t) (fe,t < Pj,t} + \ {Pi,t = Po,t}) 



Since 



Pt(2/.M > yj,t) - Pt(?/j;t > y^^t) = P^(^/^,t = 1) - Pt(2/j,t - l)=P^,t-p,,u 
a simple case analysis reveals that 



r{ij)-r{i,j)+r{j,i)-r{j,i) = < 



''5 (K,* -P3,t) ^fPi,t < Pj,t, Pi,t = Pj.t 01 Pi^t =Pj,t, Pi,t > Pj,t 

\ (p-j^t -Pi.t) Ifpi,t = pj,t, Pi.t < Pj.t orpi^t > P].t, Pl.t = Pj,t 

Pi,t - Pj,t If Pi,t < P].t, Pi.t > Pj,t 

lPj,t - Pi,t If Pi,t > Pj,t, Pi,t < Pj.t , 



which can be uniformly upper bounded by |pi 4 — t| + \pj^t — Pj.t I- 

We now specialize the above to the two sequences (pj-^^t, ■ ■ ■ ,Pjst,t7 0, . . . , 0) and 

use ^rank.t 

instead of 4a„fc. Setting = {ji,j2, ■■■Jst} 
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and Y^* = {ii, Z2, . . . , ist}^ proceeding as in Lemma|7]we can write 

^t[^rank,t{Y't, {Pji,t, ■ ■ ■ ,Pjst,t)] " ^ti^rankA^t, {Pii,t, ■ ■ ■ ,Pist,t))] 



= {St - 1) ^ - p,.t\ + (p([A^,, + e.An) - p([A:,t - e.^o)) 
= {St - 1) ^ b(A,,0 -P([AL + e,.t]D)\ + E {p{[Kt+^^.t]D)-p{[K,t - e^.t]D) 

< {St - 1) E l^''* - + + E (^'([^M + - PiiKt - ^^.t]D) 

ieYt ieYt 

< {St - 1) E (i^^* - Kt\ + ^^t) + E {p(iK + ^^-.th) piiKt - ^^Ad) 

ieYt i€Yt 

<2StCL E^^^*' 

ieYt 

as claimed. □ 

Lemma 9. Let L : D = [—R, i?] C 7?. TZ^ be a C^{D) convex and nonincreasing 
function of its argument, {ui, . . . , uk) G TZ"^^ be defined in with .g(A) = — _L'(A) 
for all A € _D, and such that \\ui\\ < U for all i G [K]. Assume there are positive 
constants and c'[ with (L'(A))2 < and L"{A) > c'lfor all A £ D. With the 
notation introduced in Figure^ we have that 

{oAw^, - ujxr < x^All^a. {U^ + + ^) + 1(1 + '"^^"""O 

holds with probability at least 1 — S for any 5 < 1/e, uniformly over i G [K], t = 
1,2, ... , and x £ TZ'^. 

Proof: For any given class i, the time-t update rule w'^ ^ — Wi t+i w[ ^J^^ in 
Figure [T] allows us to start off from lfT2l (proof of Theorem 2 therein), from which one 
can extract the following inequality 

d^^t-l{u^,w[^) 

1 2 '""^ / c" 2 

+ -rjTyi E " c" ^ y^lk{'^'i,k -Ui) - Y {s^,kxl{w'^^^ - Ui)) 
^ I^' k=l ^ k=l ^ 

(8) 



where we set ri,k = '^Jk ^i 1 '^i,k- Using the lower bound on the second derivative 
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of L we have 

c" 

c" 2 

Plugging back into ^ yields 



t-i 



d,,t_i(M,, w[ ^) <U^ + — — -^Yl xjw'ij^) - L{s,,k ujxk)) 

(9) 

We now borrow a proof technique from f5\ (see also Q H] and references therein). 
Define Li^k = L{si^k xjw'^,^) - L{si^kujxk) and L'^j. = Ek[Li^k] - Li,k- Notice 
that the sequence of random variables L[^, L\^,...^ forms a martingale difference 

sequence such that, for any i ^ Yk'. 
i. Ei.[ii fc] > 0, by Lemma|6j 

ii- l-^ifcl !i 2_L(— i?), since L(-) is nonincreasing over _D, and s,; J. a;^i(;^ Sj fc M^ajfc £ 



iii. Vark{L^ j,) = Vark{Li^k) < -pr- Ek[Li,k] (again, because of Lemma 



6i 



On the other hand, when i ^ Yk then Si k = 0, and the above three properties are 
trivally satisfied. Under the above conditions, we are in a position to apply any fast 
concentration result for bounded martingale difference sequences. For instance, setting 
for brevity B — B{t, 5) — 3 \n ^^-^^j^, [ 17| allows us derive the inequality 



t-i t-i 

Kk[Li,k] - Y ^i^^ - "^^^ 

fe=l k=l 



\ 



8c' 



that holds with probability at most for any t > 1. We use the inequality 

cb < ^{c + b) with c = B, and 6 = 2 X]fe=i ^k[Li.k], and simplify. This gives 

t-i 



k=l ^ ^ ' 



with probability at least 1 — J^^^^J^]^^ ■ Using the Cauchy-Schwarz inequality 
holding for any x G TZ"^, and replacing back into ^ allows us to conclude that 
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12 fc' 



3L{-R)j In 
(10) 

holds with probability at least 1 — jft(|qr[y, uniformly over x G TZ"^. 

The bounds on X]fe=i '^i.fc obtained in a standard way. Applying known 

inequalities (e.g., ll3ll4l[TIll7l). and using the fact that Vi.fc — L'{si,k -^^"^i k) ^i,k''^k 
we have 



t~i 



k=l 



t-1 



< 



< 



c'l Y1 l^^fel^fc A.fe^fc 
\Aik\ 



k=l 
t-1 



'.El- 



fe=i 



H,k-1\ 



c'l In 



< dc'^lnll 



t- 1 



Piecing together as in ( 10 1 and stratifying over t = 1,2, . . ., and i £ [K] concludes the 
proof. □ 

We are now ready to put all pieces together. 
Proof: [Theorem|2j From Lemma|7]and Lemma|9j we see that with probability at least 

1-S, 

T 

i?T < 2(1- a) ^^e,,t, (11) 

when ef ^ is the one given in Figure [T| We continue by proving a pointwise up- 
per bound on the sum in the RHS. More in detail, we will find an upper bound on 
Y^J^i X^ieYt t' ^""^ derive a resulting upper bound on the RHS of |TT|. 
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From Lemma|9]and the update rule (Step 5) of the algorithm we can write 

= C 



xJ{A,^t-i + \si,t\xtxJ) ^Xt 



1 - \st.t\xj {A,^t-i + \si,t\ xtxjy^x 



< C 
= C 



1 - \si^t\xj {A,^t-i + \si,t\ xtxj) ^xt 

xj A^lxt 
1 - \si^t\x'l{Ao + |si,t| Xfxjy^xt 
xj^ijxt 



— 2 C x^ A^ ^ Xi . 

Hence, if we set t = xj A^^Xt and proceed as in the proof of Lemma|9j we end up 
with the upper bound X^tli^M — 2Cdln(l + -j), holding for all i e [ A'] . Denoting 
by M the quantity 2Cdln(l + ^),we conclude from ( |TT[ l that 

{T T 1 

I ^«^-t - « e [a:] > = 2 (1-a) CL K VtM , 
i^[K]t=l t=l J 

as claimed. □ 
Proof: [Theorem |3j As we said, we change the definition of ef ^ in the Algorithm in 
Figure [T] to 



First, notice that the 4i?^ cap seamlessly applies, since {x^ w[ ^ — uj x)'^ in Lemma|9] 
is bounded by 4 anyway. With this modification, we have that TheoremSonly holds 
for t such that 0^ In (l + > f/^, i.e., for t > d (cxp (^^^^ j - l) + 1, 

while for t < d (exjp (^ '^'^'"J ^ ^ — 1^ + 1 we have in the worst-case scenario the 
maximum amount of regret at each step. From Lemma [7] we see that this maximum 
amount (the cap on e| ^ is needed here) can be bounded by 4 (1 — a) c^, ji^ | i? < 4 (1 — 
a)cLKR. ' □ 

Proof: [Theorem |4] We start from the one step-regret delivered by Lemma [8] and 
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proceed as in the proof of Theorem |2j This yields 

T 

Rt<2cl J2^t I]^^* 

T 
T 

< 2ScL 

t=i ie[K] 

T 

= 2ScL Y X!^^'*' 

ie[K] t=i 

with probability at least 1 — S, where ef ^ is the one given in Figure 1 Let M be as 
in the proof of Theorem|2] If Ni^T denotes the total number of times class i occurs in 
Yu we have that X^Li < M, implying Y^^i <^ht < V^hT M for all i e [K]. 
Moreover, X]ie[A'] ^i.T < ST. Hence 

Rt<2Scl Y V^i,T M < 2cL y/MSKT, 

leK] 

as claimed. □ 
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